Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

End-to-end speech synthesis based on WaveNet

QIU Zeyu, QU Dan, ZHANG Lianhai

Journal of Computer Applications 2019, 39 (5): 1325-1329. DOI: 10.11772/j.issn.1001-9081.2018102131

Abstract （1089）

PDF （819KB）（576）

Save

Griffin-Lim algorithm is widely used in end-to-end speech synthesis with phase estimation, which always produces obviously artificial speech with low fidelity. Aiming at this problem, a system for end-to-end speech synthesis based on WaveNet network architecture was proposed. Based on Seq2Seq (Sequence-to-Sequence) structure, firstly the input text was converted into a one-hot vector, then, the attention mechanism was introduced to obtain a Mel spectrogram, finally WaveNet network was used to reconstruct phase information to generate time-domain waveform samples from the Mel spectrogram features. Aiming at English and Chinese, the proposed method achieves a Mean Opinion Score (MOS) of 3.31 on LJSpeech-1.0 corpus and 3.02 on THchs-30 corpus, which outperforms the end-to-end systems based on Griffin-Lim algorithm and parametric systems in terms of naturalness.

Reference | Related Articles | Metrics

Select

Acoustic modeling approach of multi-stream feature incorporated convolutional neural network for low-resource speech recognition

QIN Chuxiong, ZHANG Lianhai

Journal of Computer Applications 2016, 36 (9): 2609-2615. DOI: 10.11772/j.issn.1001-9081.2016.09.2609

Abstract （636）

PDF （1145KB）（371）

Save

Aiming at solving the problem of insufficient training of Convolutional Neural Network (CNN) acoustic modeling parameters under the low-resource training data condition in speech recognition tasks, a method for improving CNN acoustic modeling performance in low-resource speech recognition was proposed by utilizing multi-stream features. Firstly, in order to make use of enough acoustic information of features from limited data to build acoustic model, multiple features of low-resource data were extracted from training data. Secondly, convolutional subnetworks were built for each type of features to form a parallel structure, and to regularize distributions of multiple features. Then, some fully connected layers were added above the parallel convolutional subnetworks to incorporate multi-stream features, and to form a new CNN acoustic model. Finally, a low-resource speech recognition system was built based on this acoustic model. Experimental results show that parallel convolutional subnetworks normalize different feature spaces more similar, and it gains 3.27% and 2.08% recognition accuracy improvement respectively compared with traditional multi-feature splicing training approach and baseline CNN system. Furthermore, when multilingual training is introduced, the proposed method is still applicable, and the recognition accuracy is improved by 5.73% and 4.57% respectively.

Reference | Related Articles | Metrics